Modelling and Mining of Networked Information Spaces

نویسندگان

  • William Aiello
  • Andrei Z. Broder
  • Jeannette C. M. Janssen
  • Evangelos E. Milios
چکیده

The Web as a text corpus Pages close in word vector space tend to be related Cluster hypothesis (van Rijsbergen 1979) The WebCrawler (Pinkerton 1994) The whole first generation of search engines weapons mass destruction p 1 p 2 Enter the Web's link structure Broder & al. 2000 p(i) = α N + (1 − α) j:j→i p(j) Text Links Meaning Connection between semantic topology (topicality or relevance) and link topology (hypertext) G = Pr[rel(p)] ~ fraction of relevant pages (generality) R = Pr[rel(p) | rel(q) AND link(q,p)] Related nodes are " clustered " if R > G (modularity) Necessary and sufficient condition for a random crawler to find pages related to start points • Stationary hit rate for a random crawler: Link-cluster conjecture η(t + 1) = η(t) ⋅ R + (1 − η(t))⋅ G ≥ η(t) η t →∞ Value added Conjecture Pages that link to each other tend to be related Preservation of semantics (meaning) A.k.a. topic drift Link-cluster conjecture L(q,δ) ≡ path(q, p) { p: path(q, p) ≤δ } ∑ { p : path(q, p) ≤ δ} € R(q,δ) G(q) ≡ Pr rel(p) | rel(q) ∧ path(q, p) ≤ δ [ ] Pr[rel(p)] JASIST 2004 10 Correlation of lexical and linkage topology L(δ): average link distance S(δ): average similarity to start (topic) page from pages up to distance δ Correlation ρ(L,S) = –0.76 The " link-content " conjecture € S(q,δ) ≡ sim(q, p) { p: path(q, p) ≤δ } ∑ { p : path(q, p) ≤ δ} Heterogeneity of link-content correlation € S = c + (1− c)e aL b edu net gov com signif. diff. a only (p<0.05) signif. diff. a & b (p<0.05) org Discussion Topic drift: Myth or reality? Mapping the relationship between links, content, and semantic topologies • Given any pair of pages, need 'similarity' or 'proximity' metric for each topology: – Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification • Data: Open Directory Project (dmoz.org) – ~ 1 M pages after cleanup – ~ 1.3*10 12 page pairs! Content similarity € σ c p 1 , p 2 () = p 1 ⋅ p 2 p 1 ⋅ p 2 term i term j term k p 1 p 2 p 1 p 2 € σ l (p 1 , p 2) = U p 1 ∩ U p 2 U p 1 ∪ …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modelling and Compensation of uncertain time-delays in networked control systems with plant uncertainty using an Improved RMPC Method

Control systems with digital communication between sensors, controllers and actuators are called as Networked Control Systems (NCSs). In general, NCSs encounter with some problems such as packet dropouts and network induced delays. When plant uncertainty is added to the aforementioned problems, the design of the robust controller that is able to guarantee the stability, becomes more complex. In...

متن کامل

Fundamentals of 3D modelling and resource estimation in coal mining

The prerequisite of maintaining an efficient and safe mining operation is the proper design of a mine by considering all aspects. The first step in a coal mine design is a realistic geometrical modelling of the coal seam(s). The structural features such as faults and folding must be reliably implemented in 3D seam models. Upon having a consistent seam model, the attributes such as calorific val...

متن کامل

FUSION FRAMES IN HILBERT SPACES

Fusion frames are an extension to frames that provide a framework for applications and providing efficient and robust information processing algorithms. In this article we study the erasure of subspaces of a fusion frame.  

متن کامل

GIS modelling for Au-Pb-Zn potential mapping in Torud-Chah Shirin area-Iran

One of the major strengths of a Geographic Information System (GIS) in geosciences is the ability to integrate and combine multiple layers into mineral potential maps showing areas which are favorable for mineral exploration. These capabilities make GIS an extremely useful tool for mineral exploration. Several spatial modeling techniques can be employed to produce potential maps. However, these...

متن کامل

Calculation of One-dimensional Forward Modelling of Helicopter-borne Electromagnetic Data and a Sensitivity Matrix Using Fast Hankel Transforms

The helicopter-borne electromagnetic (HEM) frequency-domain exploration method is an airborne electromagnetic (AEM) technique that is widely used for vast and rough areas for resistivity imaging. The vast amount of digitized data flowing from the HEM method requires an efficient and accurate inversion algorithm. Generally, the inverse modelling of HEM data in the first step requires a precise a...

متن کامل

A practical approach to open-pit mine planning under price uncertainty using information gap decision theory

In the context of open-pit mine planning, uncertainties including commodity price would significantly affect the technical and financial aspects of mining projects. A mine planning that takes place regardless of the uncertainty in price just develops an optimized plan at the starting time of the mining operation. Given the price change over the life of mine, which is quite certain, optimality o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006